1 Introduction

We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The NHANES III data set from 2009-2010 is an ongoing and continuous series of surveys focused on civilian non-institutional population fo the United states and published by CDC every year. It is designed to be a surveillance of specific diseases and behaviors, providing statistical insights into the U.S population.(G et al. 2013). The survey program assesses the health and nutritional status of adults and children in the United States since the 1960s, combining in-person face-to-face interviews and physical examinations of participants for data collection according to CDC (2023a).

The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.

In this assignment, we accessed the data set from the aplore3 package in R, which contains 6482 observations in 21 variables. Eight of these variables (gender, marstat, vigwrk, modwrk, wlkbik, vigrecexr, and obesity) are considered as categorical variables, while the rest are numerical variables. We aim to study the relationship between the weight variable and the other health related variables of the data.

2 Method

The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. While there are other indicators like waist circumference and waist-to-height ratio for better measurement of human health conditions (LB et al. 2016), we didn’t have these data available in the package, therefore we categorized the weight by BMI levels. The nhanes data from the package came with the indicator “obese” showing if the participant was obese or not by considering BMI level greater than a threshold of 35, thus resulted in the obese variable as the binary random variable. However, a slightly better way to categorizing the weight was introduced by CDC’s guideline (2022a). According to CDC’s classification on body weight, we have: BMI \(\leq\) 18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI \(\geq\) 30 as obesity. In our assignment we will explore both ways of categorizing the weight to properly fit in our analysis.

We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests to find the relationship between the weight variable and other variables.

3 Analysis

We first of all gave Table 1 below to illustrate the overall appearance of the variables given the weight being categorized under CDC’s approach. For categorical variables, the first three columns presented the total number of observation and percentage for different levels of BMI, and the last column was the sum of the first three columns and the percentage of the total observations. For numerical variables, the first three columns were the mean and standard deviation under the stratification condition and the last column was the mean and standard deviation under overall level. The rows showed the variable name with its subcategories. If the variable had missing values, there was an additional row showing the statistic. In the original data set, it existed 6482 observations and 37 were missing for the variable bmi. Due to small size of the missing data less than 0.6% in total, We omited these missing data for bmi and only considered 6445 observations as total sample size. However, as one can see that some variables had a high percentage of overall missing values. the highest being Marital Status as 9.7%. This high percentage of missing value could affect one’s analysis, for example in a linear regression model estimating the relationships between the weight and others. One could question if the marital status really had any influence on the body weight.

We had several numerical variables for which we wanted to explain, namely the cholesterol levels and the blood pressure. Cholesterol is a type of lipid that helps body perform important body functions. The variable “tchol” indicated the Total Cholesterol, the total amount of cholesterol circulated in people’s blood. The variable “hdl” indicated the HDL-Cholesterol, which is the high-density lipoprotein cholesterol that helps people transfer excess cholesterol from the blood to the liver (2022b). “sysbp” standed for Systolic Blood Pressure and “dbp” stands for Diastolic Blood Pressure. “Systolic pressure represents the maximum blood pressure when ventricles contract while diastolic pressure denotes the minimum blood pressure registered just before the subsequent contraction (Walker, Hall, and Hurst 1990). We discussed the thresholds for cholesterol and blood pressure in the following discussion. Two other numerical variables in this data set were worth noticing, psu and strata. They were the Pseudo-PSU and Pseudo-stratum used in sampling strategy. In this data set, stratum was defined by geography and psu were selected from every stratum with probability proportional to a measure of size(PPS). More details about survey weights for NHANES will be presented in the extra topic section.

We also had several categorical variables. One can find the work or recreational activities, each corresponding to the levels of intensity being vigorous and moderate. Level of vigorous was defined by the experimental design such that the intense physical exertion leading to significant elevation in respiration and heart rate. It usually sustained for more than 10 minutes during the work or recreational activities. Moderate work and recreational activities were defined such that activities require moderate physical effort and results in slight increase in respiration and heart rate, which usually sustained for less than 10 minutes during the work or recreational activities.(2011)

Table 3.1: Characteristics of the data set NHANES
Healthy weight Obesity Overweight Underweight Overall
(N=1883) (N=2311) (N=2127) (N=124) (N=6445)
Gender
Male 897 (47.6%) 1036 (44.8%) 1171 (55.1%) 40 (32.3%) 3144 (48.8%)
Female 986 (52.4%) 1275 (55.2%) 956 (44.9%) 84 (67.7%) 3301 (51.2%)
Age (years)
Mean (SD) 41.2 (± 20.6) 48.7 (± 17.7) 48.9 (± 19.0) 37.9 (± 21.0) 46.4 (± 19.4)
Marital Status
Married 741 (39.4%) 1158 (50.1%) 1074 (50.5%) 31 (25.0%) 3004 (46.6%)
Widowed 121 (6.4%) 185 (8.0%) 190 (8.9%) 8 (6.5%) 504 (7.8%)
Divorced 154 (8.2%) 262 (11.3%) 210 (9.9%) 14 (11.3%) 640 (9.9%)
Separated 47 (2.5%) 82 (3.5%) 63 (3.0%) 1 (0.8%) 193 (3.0%)
Never Married 351 (18.6%) 353 (15.3%) 289 (13.6%) 30 (24.2%) 1023 (15.9%)
Living Together 141 (7.5%) 148 (6.4%) 157 (7.4%) 8 (6.5%) 454 (7.0%)
Missing 328 (17.4%) 123 (5.3%) 144 (6.8%) 32 (25.8%) 627 (9.7%)
Statistical Weight
Mean (SD) 36700 (± 26000) 33000 (± 25100) 34200 (± 26300) 37400 (± 27800) 34600 (± 25800)
Pseudo-PSU
Mean (SD) 1.51 (± 0.500) 1.50 (± 0.500) 1.51 (± 0.500) 1.50 (± 0.502) 1.51 (± 0.500)
Pseudo-stratum
Mean (SD) 7.11 (± 4.09) 7.36 (± 4.13) 7.15 (± 4.16) 7.80 (± 4.14) 7.22 (± 4.13)
Total Cholesterol (mg/dL)
Mean (SD) 185 (± 39.9) 194 (± 40.5) 198 (± 42.8) 172 (± 33.4) 192 (± 41.4)
Missing 123 (6.5%) 142 (6.1%) 121 (5.7%) 6 (4.8%) 392 (6.1%)
HDL-Cholesterol (mg/dL)
Mean (SD) 58.4 (± 17.1) 47.6 (± 13.7) 51.8 (± 15.5) 63.3 (± 17.1) 52.5 (± 16.0)
Missing 124 (6.6%) 142 (6.1%) 120 (5.6%) 6 (4.8%) 392 (6.1%)
Systolic Blood Pressure (mm Hg)
Mean (SD) 119 (± 18.5) 125 (± 17.3) 125 (± 18.5) 111 (± 18.5) 123 (± 18.3)
Missing 164 (8.7%) 206 (8.9%) 154 (7.2%) 20 (16.1%) 544 (8.4%)
Diastolic Blood Pressure (mm Hg)
Mean (SD) 67.4 (± 11.2) 71.3 (± 12.4) 69.8 (± 11.8) 65.7 (± 11.3) 69.6 (± 11.9)
Missing 167 (8.9%) 230 (10.0%) 170 (8.0%) 18 (14.5%) 585 (9.1%)
Weight (Kg)
Mean (SD) 63.1 (± 9.13) 99.0 (± 17.7) 77.3 (± 10.3) 47.9 (± 5.56) 80.4 (± 20.2)
Standing Height (cm)
Mean (SD) 168 (± 10.0) 167 (± 10.4) 168 (± 10.4) 166 (± 7.59) 167 (± 10.2)
Vigorous Work Activity
Yes 324 (17.2%) 418 (18.1%) 371 (17.4%) 16 (12.9%) 1129 (17.5%)
No 1558 (82.7%) 1893 (81.9%) 1756 (82.6%) 108 (87.1%) 5315 (82.5%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Moderate Work Activity
Yes 651 (34.6%) 796 (34.4%) 701 (33.0%) 32 (25.8%) 2180 (33.8%)
No 1231 (65.4%) 1515 (65.6%) 1426 (67.0%) 92 (74.2%) 4264 (66.2%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Walk or Bicycle
Yes 630 (33.5%) 549 (23.8%) 573 (26.9%) 48 (38.7%) 1800 (27.9%)
No 1252 (66.5%) 1762 (76.2%) 1554 (73.1%) 76 (61.3%) 4644 (72.1%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Vigorous Recreational Activities
Yes 579 (30.7%) 344 (14.9%) 449 (21.1%) 27 (21.8%) 1399 (21.7%)
No 1303 (69.2%) 1967 (85.1%) 1678 (78.9%) 97 (78.2%) 5045 (78.3%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Moderate Recreational Activities
Yes 834 (44.3%) 791 (34.2%) 823 (38.7%) 37 (29.8%) 2485 (38.6%)
No 1048 (55.7%) 1520 (65.8%) 1303 (61.3%) 87 (70.2%) 3958 (61.4%)
Missing 1 (0.1%) 0 (0%) 1 (0.0%) 0 (0%) 2 (0.0%)
Minutes of Sedentary Activity per Week (mins)
Mean (SD) 316 (± 185) 333 (± 186) 308 (± 184) 366 (± 195) 321 (± 186)
Missing 17 (0.9%) 34 (1.5%) 26 (1.2%) 1 (0.8%) 78 (1.2%)
Obese
No 1883 (100%) 1325 (57.3%) 2127 (100%) 124 (100%) 5459 (84.7%)
Yes 0 (0%) 986 (42.7%) 0 (0%) 0 (0%) 986 (15.3%)

3.1 Age and Gender

We first began our analysis on the relationship between the weight variable with the Age and Gender variables. This data set mainly focused on the observers between 16 to 80 years old. Among them, the average weight for male was greater than female among all ages, and as we can see from the line chart that the change in average weight with age followed the same trend across the gender, with a general tendency to sustained increase, followed by fluctuation and continuous decrease finally. One can conclude that there might existed some relationship between weight and age.

Average Weight in Different Age

Figure 3.1: Average Weight in Different Age

In order to find the relationship between weight and gender or age , we first categorized the numerical variables. As mentioned in Table 1, we use the BMI level defined by CDC for categorize the weight in the following analysis. For age, we divided it into three groups based on NIH recommendations: younger than 18 for adolescents, 18 to 65 for adults, and older than 65 for older adults. Then we created the contingency tables for variables and used pearson GOF chi-square test and likelihood ratio test to testing independence and discussing difference from these test methods.

Table 3.2: Contingency Table for BMI and Agegroup
Agegroup
Adolescents Adults Older Adults
BMI level Underweight 17 86 21
Healthy 188 1322 343
Overweight 74 1536 518
Obesity 63 1765 512
Table 3.3: Contingency Table for BMI and Gender
Gender
Male Female
BMI level Underweight 40 84
Healthy 884 969
Overweight 1168 960
Obesity 1052 1288
Table 3.4: Chi-square test p-values of Independence Tests between Variables
Weight VS Age Weight VS.Gender
p-value 2.543e-32 6.311e-13
Table 3.5: Likelihood ratio test p-values of Independence Tests between Variables
Weight VS Age Weight VS.Gender
p-value 1.976e-29 5.224e-13

We observed that under the Chi-squared test, the p-valve for weight and age was \(2.543\times10^{-32}\) and the p-value for weight and gender was \(6.311\times10^{-13}\).The p-values from likelihood ratio test were slightly different with \(1.976\times10^{-29}\) for weight and age, \(5.223\times10^{-13}\) for weight and gender. Both two methods had the p-value much less than 0.001, hence indicated that we could reject the null hypothesis and concluded there existed relationship between weight and age, and weight and gender. One can further test that there is a linear relationship between the weight and age given gender by fitting a linear model. Because of the presence of the trend, one might do three linear models on three categories of the age variable, or one might consider a polynomial regression model for the weight response and the overall age predictor, which would not be further discussed in this assignment.

3.2 Marital Status

We then analyzed the relationship between weight and marital status. The following box plot shows that the median weight under different marital status are all around 80 Kg, except widowed observations having lowest weight among the six categories. Married and Never Married observations had more people heavier than 130 Kg than other categories, however all the outliers above 140 kilograms were rather similar, therefore from this boxplot we could not really see a difference of the weight distribution among categories of marital status.

Boxplot of Weight for Different Marital Status

Figure 3.2: Boxplot of Weight for Different Marital Status

This prompt us to question if there was really a relationship between the variables, so we tested the independence using the Chi-squared test. For simplicity we considered categorizing the weight to a binary “obese” random variable by defining obesity as BMI level \(\geq 35\), as given in the data set. We formed the contingency table Table 3.6.

Table 3.6: Contingency Table
Obesity
No Yes
Marital Status Married 2530 474
Widowed 418 86
Divorced 528 112
Separated 158 35
Never Married 863 160
Living Together 388 66
Table 3.7: pvalues for Chi-squared Independence Test
marstat
p-value 0.6894

Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:

\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]

From the p-value being equal to 0.6894,we concluded that there was not enough evidence to reject the null hypothesis. In other words, we could not conclude that there was a relationship between obesity and marital status, hence supporting our initial guess in the boxplot analysis.

3.3 Cholesterol and Blood Pressure

Cholesterol is an essential fat in the body. We first gave a exploratory analysis on the relationship between the weight and the cholesterol variables. We adopted CDC’s category for weight and plotted the mean levels of total cholesterol and HDL cholesterol in Figure 3.5. We found that there was a slight positive relationship between body weight and the total cholesterol level and noticed that there was a negative relationship between the HDL and body weight. Because of the fact that Total cholesterol level is the sum of HDL and LDL level, we can conclude that the obese population has a high level of LDL and a low level HDL. Our analysis was inline with a more recent study on the effect of BMI on lipid profile in children and adolescents in Saudi Arabia (AA and AE 2019). In this study the researchers concluded that “High BMI was found to be associated with increased levels of LDL cholesterol and decreased levels of HDL cholesterol. No significant association between gender and changes in lipid profile was established (P = 0.898)”.

Figure 3.3: Mean Cholesterol level across Categories of Body Weight

We then categorized the cholesterol and blood pressure level into different levels. According to American Heart Association ATPIII (Cleeman 2001), We have: tchol \(\geq\) 240mg/dl OR hdl \(<\) 40mg/dL as Dangerous Cholesterol level; tchol between 200-239mg/dL OR hdl between 40-59mg/dL for males (50-59mg/dL for females) as At Risk level; tchol \(<\) 200mg/dL and hdl \(\geq\) 60mg/dL as Healthy level.

Blood pressure is typically categorized into different stages based on the systolic (top number) and diastolic (bottom number) readings. According to American Heart Association (2023b), We have: systolic level \(<\) 120mm Hg and diastolic level \(<\) 80mm Hg as normal blood pressure; systolic level between 120-129mm Hg and diastolic level \(<\) 80mm Hg as elevated blood pressure; systolic level between 130-139mm Hg OR diastolic level between 80-89mm Hg as Hypertension Stage 1; systolic level \(\geq\) 140mm Hg OR diastolic level \(\geq\) 90mm Hg as Hypertension Stage 2; systolic level \(>\) 180mm Hg OR diastolic level \(>\) 120mm Hg as Hypertensive Crisis.

After categorizing Cholesterol level and blood pressure, we create three contingency tables in Table 3.8, Table 3.9 and Table 3.10 below.

Table 3.8: Contingency Table: Weight vs. Cholesterol
Cholesterol
At Risk Dangerous Healthy
Weight Healthy 811 294 471
Obese 856 776 363
Overweight 884 622 303
Underweight 44 10 46
Table 3.9: Contingency Table: Weight vs. Blood Pressure
Blood Pressure
Elevated Hypertension Stage 1 Hypertension Stage 2 Hypertensive Crisis Normal
Weight Healthy 234 255 200 17 870
Obese 330 522 412 10 721
Overweight 309 383 344 24 749
Underweight 10 9 8 1 72
Table 3.10: Contingency Table: Cholesterol vs. Blood Pressure
Blood Pressure
Elevated Hypertension Stage 1 Hypertension Stage 2 Hypertensive Crisis Normal
Cholesterol At Risk 420 587 448 25 1115
Dangerous 302 391 338 14 657
Healthy 161 191 178 13 640
Table 3.11: p-values of Independence Tests between Variables
Weight VS Cholesterol Weight VS. Blood Pressure Cholesterol VS. Blood Pressure
p-value 1.475e-53 5.723e-35 3.286e-13

We again conducted hypothesis testing using Pearson chi-squared test. From Table 3.11 we could see that the p-value for weight and cholesterol is extremely small, equals to \(1.475\times10^{-53}\). This suggested that there was a statistically significant association between weight and cholesterol levels. For weight against blood pressure, the p-value was \(5.723\times10^{-35}\) which indicates that between weight and blood pressure, there was significant relationship. Cholesterol level was also associated with blood pressure with a p-value equals to \(3.286\times10^{-13}\).

However, there are potential confounding variables like diet, therefore for further study one needs to collect the data and conduct a stratified analysis.

Overall, the results suggested that there was a significant association between weight, cholesterol, and blood pressure. The patterns observed in the contingency tables align with general health knowledge: obesity is a risk factor for both high cholesterol and high blood pressure.

3.4 Activities

We then want to analyze the relationship between body weight and human activities. The two types of activity measurements were given in the data, the work activity and the recreational activity, each given at intensity levels of vigorous and moderate.

Figure 3.4: Boxplot of BMI for Different Recreational Activity Conditional on Work Activity

From the left plot of Figure 3.4 above one can see that the vigorous recreational activities yield BMI observations between the healthy range of 18.5 to 30, regardless of the condition on work activities. However, the non-vigorous recreational activities yield systemically higher levels of BMI than that of the vigorous recreational activities, again regardless of the condition on work activities. From the right-hand side plot we observed the same pattern when considering moderate intensity of work and recreational activities. If one do a vertical comparison between the intensity of activities, one could see that there was no significant difference except that vigorous recreational activities had the 25% to 75% quantile range of BMI a little closer to the healthy interval defined by CDC than moderate recreational activities did. Assuming the risk \(p\) is defined as the probability of having a BMI NOT inside the range of 18.5 to 30, i.e. either being underweight or obesity. We defined \(p_{VigR=Y}\) as the risk for the participants with vigorous recreational activities, and \(p_{VigR=N}\) is the risk for those with no vigorous recreational activities. We therefore defined a binary random variable called “InRange” to indicate that if the body weight is within the healthy range. We proposed the following two statements:

  1. \(p_{VigR=Y} < p_{VigR=N}\), perhaps independent of Work Activities

  2. Intensity of recreational activities also plays a role in getting healthy weight

To support our first statement, we first did the independence test for all activity-related variables with obesity. Based on the pvalues shown in Table 3.12, we rejected the independence between InRange and wlkbik, vigrecexr and modrecexr variables, and also concluded that we didn’t have enough evidence to reject the independence between the InRange variable with Work Activities.

Table 3.12: p-values of Independence Tests between Different Variables and HealthyRange of BMI
vigwrk modwrk wlkbik vigrecexr modrecexr
p-value 0.645 0.8284 2.138e-06 1.413e-22 5.188e-09

Then we computed the marginal and aggregated Odds Ratios given levels of Vigorous Work Activities.

Table 3.13: Conditional Contingency Table
Recreational|VigW=Y
VigR=Yes VigR=No
InRange Yes 221 474
No 97 337
Table 3.14: Conditional Contingency Table
Recreational|VigW=N
VigR=Yes VigR=No
InRange Yes 807 2507
No 274 1727
Table 3.15: Marginal Contingency Table, Aggregate VigW
Recreational
VigR=Yes VigR=No
InRange Yes 1028 2981
No 371 2064
Table 3.16: OddsRatio for Conditional and Marginal Probability
Estimated Odds Ratio
VigW=Y 1.619840
VigW=N 2.028902
Aggregated VigW 1.918523

In all cases we have an Odds ratio greater than 1, indicating that doing vigorous recreational activities can lower the odds of getting a risky BMI, regardless of having the vigorous work activities or not. This gave us a potential suggestion to those who have to work for a long time without gaining adequate physical activities. Recreational activity seems to play a more important role in maintaining a healthy level of body weight.

We then checked that given the body weight falls within the healthy range, was there a difference between which intensity of recreational activities has been experienced. Let the probability of having a BMI inside the healthy range of 18.5 to 30 for doing vigorous recreational activities as \(p_{VigR}\), and for moderate recreational activities as \(p_{ModR}\). We want to test the hypothesis:

\[\begin{gather*} H_0: p_{VigR} \leq p_{ModR}\\ H_1:p_{VigR} > p_{ModR} \end{gather*}\]
Table 3.17: Frequency Table
InRange Vigorous Recreational=Y Vigorous Recreational=N Moderate Recreational=Y Moderate Recreational=N
No 371 2064 828 1607
Yes 1028 2981 1657 2351
Table 3.18: pvalues for two-sample test
VigR_Yes VigR_No
pvalue 3.4e-06 0.3830068

We gained the MLE of the parameters from the Frequency Table 3.16 as:

\[\begin{gather*} p_{VigR=Y} = \frac{1028}{1028+371}, ~ p_{ModR=Y} = \frac{1657}{1657+828} \\ p_{VigR=N} = \frac{2981}{2981+2064}, ~ p_{ModR=N} = \frac{2351}{2351+1607} \\ \end{gather*}\]

Our conclusion was that we did see there was a higher probability of getting the healthy weight when Vigorous Recreational Activities were done. However when choosing the moderate level of recreational activities, there’s no conclusion in the hypothesis test.

In all, we saw from our analysis that having vigorous or moderate recreational activities tend to give healthy range of BMI, while moderate or vigorous work activities might not have an significant influence on body weight conditions.

3.5 Exrtra topic: Survey weights in complex survey designs

Most large-scale surveys often involve a combination of multiple sampling design techniques, like stratification and cluster sampling, enable researchers to obtain accurate estimates while catering to practical and cost considerations. A central tenet to these designs is the concept of sampling weights, pivotal in ensuring unbiased estimation.

Sampling weights are best defined as the inverse of the probability that a specific unit gets selected in the sample. These weights adjust for design-imposed inequalities in selection probabilities and are used tp compute point estimates.

In the case of the stratified random sampling. The population U of size N is partitioned into stratums denoted by \(U_1,...,U_h,...,U_H\). The size of \(h\)th stratum is denoted by \(N_h\).In stratum \(h\), a random sample \(S_h\) of size \(n_h\) is selected based on a sampling design, here we use simple random sampling for simplicity and efficiency. In this stratified random sampling, the estimate of population total can be show as following(Lohr 2022):

\(\hat{t}_{str}=\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}\)

where the \(w_{hj}=N_h/n_h\) represents the sampling weight of the \(j\)th observation in the \(h\)th stratum, \(y_{hj}\). Note that the probability of sample selection of the \(j\)th unit in the \(h\)th stratum is \(\pi_{hj}=n_h/N_h\).In this case, the sampling weight is the inverse of such probability \(\pi_{hj}\). The unbiased estimator of the population mean \(\bar{y}_U\) can also be shown with sampling weight as following(Lohr 2022):

\(\hat{\bar{y}}_{str}=\frac{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}}{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}}\)

Cluster sampling is another complex sampling technique for large-scale surveys when the population elements are dispersed and the the fieldwork is costly. We sample the primary sampling units(psu’s) which are often from the natural groupings of the population elements. In cluster sampling, we have \(N\) as the number of psu’s in the population. \(i\)th psu contains \(M_i\) elements. For sample of psu’s, \(S\), We denote \(n\) as the number of psu’s in the sample. For a two-stage cluster sampling, a \(S_i\) sub-sample of secondary sampling units(ssu’s) is chosen from \(i\)th psu, \(i=1,...,n\). The sub-sample size is \(m_i\).

In the case of two-stage cluster sampling with equal probabilities, the sampling weight can be expressed as the following: \(w_{ij}=1/\pi_{ij}=\frac{NM_i}{nm_i}\), where \(\pi_{ij}\) is the probability that the \(j\)th ssu in the \(i\)th psu is in the sample. For unequal probabilities, we need the probability that the \(i\)th psu is in the sample,\(\pi_i\), and the probability that the \(j\)th ssu is in the sample given that the \(i\)th psu is in the sample, \(\pi_{j|i}\). Then, the sampling weight is given by \(w_{ij}=1/(\pi_i\pi_{j|i})\).

With the sampling weight shown above, the estimator of the population total in cluster sampling can also be show in the following forms(Lohr 2022):

\(\hat{t}=\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}\)

and the estimator of the population mean:

\(\hat{\bar{y}}=\frac{\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}}{\sum_{i\in S}\sum_{j\in S_i}w_{ij}}\)

In essence, sampling weights in complex survey designs play a crucial role in ensuring that the survey results are both accurate and generalizable to the broader population.

4 Conclusion

In our analysis of the NHANES dataset, significant associations were identified between body weight, age, gender, cholesterol levels, and blood pressure. Lifestyle factors, especially of vigorous recreational activities, were linked to healthier BMI ranges, emphasizing their importance in weight management. While marital status did not show a significant relationship with obesity, other factors like age and type of physical activity did impact weight. However, these findings should be interpreted cautiously due to potential confounding variables not accounted for in the dataset.

References

2011. https://wwwn.cdc.gov/nchs/nhanes/2009-2010/DIQ_F.htm.
2022a. https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html.
2022b. https://my.clevelandclinic.org/health/articles/11920-cholesterol-numbers-what-do-they-mean.
2023b. https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings.
2023a. https://www.cdc.gov/nchs/nhanes/about_nhanes.htm.
AA, Milyani, and Al-Agha AE. 2019. “The Effect of Body Mass Index and Gender on Lipid Profile in Children and Adolescents in Saudi Arabia.” Ann Afr Med 18 (1): 42–46.
Cleeman, James I. 2001. “Executive Summary of the Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III).” American Medical Association 285 (19).
G, Zipf, Chiappa M, Porter KS, et al. 2013. “National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010.” National Center for Health Statistics 1 (56).
LB, Sardinha, Santos DA, Silva AM, Grøntved A, Andersen LB, and Ekelund U. 2016. “A Comparison Between BMI, Waist Circumference, and Waist-to-Height Ratio for Identifying Cardio-Metabolic Risk in Children and Adolescents.” PLoS One 11 (2).
Lohr, Sharon L. 2022. Sampling: Design and Analysis. CRC Press, Taylor & Francis Group.
Walker, H Kenneth, W Dallas Hall, and J Willis Hurst. 1990. “Clinical Methods: The History, Physical, and Laboratory Examinations.” Butterworths.